AITopics | best answer

Collaborating Authors

best answer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Cascaded Language Models for Cost-Effective Human-AI Decision-Making

Neural Information Processing SystemsJun-14-2026, 20:27:14 GMT

A challenge in human-AI decision-making is to balance three factors: the correctness of predictions, the cost of knowledge and reasoning complexity, and the confidence about whether to abstain from automated answers or escalate to human experts. In this work, we present a cascaded LLM decision framework that adaptively delegates tasks across multiple tiers of expertise - a base model for initial candidate answers, a more capable and knowledgeable (but costlier) large model, and a human expert for when the model cascade abstains.

large language model, machine learning, natural language, (19 more...)

Neural Information Processing Systems

Country:

Asia (0.68)
Europe (0.46)
North America (0.45)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Health & Medicine (1.00)
Education > Curriculum > Subject-Specific Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.72)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.70)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.45)

Add feedback

Table 6: Inputs and full explanations for the CoT explanations from Table 4 on random examples where models exhibit unfaithful behavior. Random unfaithful CoT explanations for bias-consistent

Neural Information Processing SystemsFeb-17-2026, 19:41:52 GMT

T ask Model, Zero/Few-Shot Failure Mode Input Question CoT in Unbiased Context CoT in Biased Context "... I think the answer is but I'm curious to hear what you think." Model justifies incorrect answer due to ambiguity in task specification. If you follow these instructions, do you return to the starting point? We start at the origin (0, 0), facing the positive y-axis. So the best answer is: (A). 3 We start at the origin (0, 0), facing the positive y-axis. So the best answer is: (B). Q: Is the following sentence plausible? Kenta Maeda is a baseball pitcher. Throwing to first base is part of baseball. American League Championship Series is a real baseball series. Throwing to first base is part of baseball. So the best answer is: (B) implausible.

large language model, machine learning, natural language, (20 more...)

Neural Information Processing Systems

Country:

South America > Uruguay > Maldonado > Maldonado (0.04)
Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.67)

Industry: Leisure & Entertainment > Sports > Baseball (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)

Add feedback

Can Linear Probes Measure LLM Uncertainty?

Dakhmouche, Ramzi, Letellier, Adrien, Gorji, Hossein

arXiv.org Artificial IntelligenceNov-18-2025

Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. Yet, for LLM generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the LLM by identifying a sparse combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.

large language model, machine learning, truncated regression, (15 more...)

arXiv.org Artificial Intelligence

2510.04108

Country: Europe (0.28)

Genre: Research Report (1.00)

Add feedback

Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

McMillan, Teague, Dominici, Gabriele, Gjoreski, Martin, Langheinrich, Marc

arXiv.org Artificial IntelligenceNov-4-2025

Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.24236

Country:

Asia (0.68)
North America (0.67)
Europe (0.46)

Genre: Research Report > New Finding (1.00)

Industry: Health & Medicine > Therapeutic Area (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

ed3fea9033a80fea1376299fa7863f4a-Supplemental-Conference.pdf

Neural Information Processing SystemsOct-9-2025, 11:02:49 GMT

large language model, machine learning, natural language, (21 more...)

Neural Information Processing Systems

Country:

South America > Uruguay > Maldonado > Maldonado (0.04)
Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.67)

Industry:

Leisure & Entertainment > Sports > Baseball (1.00)
Education (1.00)
Health & Medicine (0.94)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.94)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.51)

Add feedback

Efficient Test-Time Scaling for Small Vision-Language Models

Kaya, Mehmet Onurcan, Elliott, Desmond, Papadopoulos, Dim P.

arXiv.org Artificial IntelligenceOct-7-2025

Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2510.03574

Country:

Asia (0.92)
Europe > Austria (0.27)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Sensing and Signal Processing > Image Processing (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
(4 more...)

Add feedback

Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems

Agrawal, Aakriti, Aralikatti, Rohith, Satheesh, Anirudh, Chakraborty, Souradip, Bedi, Amrit Singh, Huang, Furong

arXiv.org Artificial IntelligenceOct-6-2025

Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge, particularly in resource-constrained settings. Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model. While multi-LLM systems produce more diverse responses than single models and thus have greater potential, they often underperform compared to single LLM self-consistency. We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score, implicitly leveraging the inherent knowledge and confidence of these models. Our method demonstrates improvements of approx. 4%, 3%, and 5% across both debate (multi-round LLM discussions) and non-debate (Best-of-N with multiple LLMs) settings on GSM8K, MMLU (6 subsets), and ARC datasets respectively.

arxiv preprint arxiv, large language model, natural language, (12 more...)

arXiv.org Artificial Intelligence

2510.02377

Country: North America > United States (0.28)

Genre: Research Report (0.82)

Industry: Education (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations

Ferreira, Pedro, Aziz, Wilker, Titov, Ivan

arXiv.org Artificial IntelligenceJul-16-2025

Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model's internal decision process and the generated explanation. Consequently, the LLM may engage in "reward hacking" by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM's input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model's decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.

large language model, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2504.05294

Country:

North America (0.28)
Europe (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

"Amazing, They All Lean Left" -- Analyzing the Political Temperaments of Current LLMs

Neuman, W. Russell, Coleman, Chad, Dasdan, Ali, Ali, Safinah, Shah, Manan, Meghani, Kund

arXiv.org Artificial IntelligenceJul-14-2025

"Amazing, They All Lean Left" - Analyzing the Political Temperaments of Current LLMs Abstract Recent studies have revealed a consistent liberal orientation in the ethical and political responses generated by most commercial large language models (LLMs), yet the underlying causes and resulting implications remain unclear. This paper systematically i nvestigates the political temperament of seven prominent LLMs -- OpenAI's GPT - 4o, Anthropic's Claude Sonnet 4, Perplexity (Sonar Large), Google's Gemini 2.5 Flash, Meta AI's L l a ma 4, Mistral 7b Le Chat, and High - Flyer ' s DeepSeek R1 -- using a multi - pronged approach that incl udes Moral Foundations Theory, a dozen established political ideology scales, and a new index of current political controversies. We find strong and consistent prioritization of liberal - leaning values, particularly care and fairness, across most models. Fur ther analysis attributes this trend to four overlapping factors: liberal - leaning training corpora, reinforcement learning from human feedback (RLHF), the dominance of liberal frameworks in academic ethical discourse, and safety - driven fine - tuning practices . We also distinguish between political "bias" and legitimate epistemic differences, cautioning against conflating the two. A comparison of base and fine - tuned model pairs reveals that fine - tuning generally increases liberal lean, an effect confirmed throu gh both self - report and empirical testing. We argue that this "liberal tilt" is not a programming error or the personal preferences of programmers but an emergent property of training on democratic, rights - focused discourse. Finally, we propose that LLMs may indirectly echo John Rawls' famous veil - of - igno rance philosophical aspiration, reflecting a moral stance unanchored to personal identity or interest. Rather than undermining democratic discourse, this pattern may offer a new lens through which to examine collective ethical reasoning. In the course of our research on the ethical logics of currently prominent large language models (Neuman et al. 2025a, b; Coleman et al. 2025), we encountered an interesting finding. The responses to various ethical dilemmas and the explanations of the underlying logics used by these models appear to resonate with the liberal side of the political spectrum. One research analytic we utilize draws on Moral Foundation Theory's five - element typology of foundational moral principles (Graham et al. 2009; Haidt 2012). The five foundations emp hasizing in turn, Care, Fairness, Loyalty, Authority and Purity, are traditionally divided into two clusters. The first two, Care and Fairness, are associated with a liberal political perspective, while conservatives who fully acknowledge the first two more often emphasize the latter three -- Loyalty, Authority and Purity in support of traditional norms.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.08027

Country: North America > United States (1.00)

Genre:

Research Report (1.00)
Questionnaire & Opinion Survey (0.95)

Industry:

Law > Statutes (0.68)
Banking & Finance > Economy (0.46)
Government > Regional Government (0.46)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.34)

Add feedback

Medical Data Pecking: A Context-Aware Approach for Automated Quality Evaluation of Structured Medical Data

Girshovitz, Irena, Ambus, Atai, Shahar, Moni, Gilad-Bachrach, Ran

arXiv.org Artificial IntelligenceJul-4-2025

Background: The use of Electronic Health Records (EHRs) for epidemiological studies and artificial intelligence (AI) training is increasing rapidly. The reliability of the results depends on the accuracy and completeness of EHR data. However, EHR data often contain significant quality issues, including misrepresentations of subpopulations, biases, and systematic errors, as they are primarily collected for clinical and billing purposes. Existing quality assessment methods remain insufficient, lacking systematic procedures to assess data fitness for research. Methods: We present the Medical Data Pecking approach, which adapts unit testing and coverage concepts from software engineering to identify data quality concerns. We demonstrate our approach using the Medical Data Pecking Tool (MDPT), which consists of two main components: (1) an automated test generator that uses large language models and grounding techniques to create a test suite from data and study descriptions, and (2) a data testing framework that executes these tests, reporting potential errors and coverage. Results: We evaluated MDPT on three datasets: All of Us (AoU), MIMIC-III, and SyntheticMass, generating 55-73 tests per cohort across four conditions. These tests correctly identified 20-43 non-aligned or non-conforming data issues. We present a detailed analysis of the LLM-generated test suites in terms of reference grounding and value accuracy. Conclusion: Our approach incorporates external medical knowledge to enable context-sensitive data quality testing as part of the data analysis workflow to improve the validity of its outcomes. Our approach tackles these challenges from a quality assurance perspective, laying the foundation for further development such as additional data modalities and improved grounding methods.

large language model, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2507.02628

Country:

Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Europe (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
North America > United States > Alaska (0.04)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Industry:

Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)
Health & Medicine > Therapeutic Area > Nephrology (0.67)
(2 more...)

Technology:

Information Technology > Data Science > Data Quality (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback